1. Load packages and dependencies¶
# Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Jupyter Notebook Options
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.set_option('display.expand_frame_repr', False)
%matplotlib inline
warnings.filterwarnings('ignore')
2. About the dataset¶
2.1 Dataset Loading¶
col_names = ['id','cycle','setting1','setting2','setting3','s1','s2','s3','s4','s5','s6','s7','s8','s9',
's10','s11','s12','s13','s14','s15','s16','s17','s18','s19','s20','s21']
parameters_dict = {
'id':'engine',
's1':"(Fan inlet temperature) (◦R)",
's2':"(LPC outlet temperature) (◦R)",
's3':"(HPC outlet temperature) (◦R)",
's4':"(LPT outlet temperature) (◦R)",
's5':"(Fan inlet Pressure) (psia)",
's6':"(bypass-duct pressure) (psia)",
's7':"(HPC outlet pressure) (psia)",
's8':"(Physical fan speed) (rpm)",
's9':"(Physical core speed) (rpm)",
's10':"(Engine pressure ratio(P50/P2)",
's11':"(HPC outlet Static pressure) (psia)",
's12':"(Ratio of fuel flow to Ps30) (pps/psia)",
's13':"(Corrected fan speed) (rpm)",
's14':"(Corrected core speed) (rpm)",
's15':"(Bypass Ratio) ",
's16':"(Burner fuel-air ratio)",
's17':"(Bleed Enthalpy)",
's18':"(Required fan speed)",
's19':"(Required fan conversion speed)",
's20':"(High-pressure turbines Cool air flow)",
's21':"(Low-pressure turbines Cool air flow)"
}
df_train_raw = pd.read_csv('PM_train.txt', sep = ' ', header=None)
df_train_raw.drop([26,27], axis=1, inplace=True)
df_train_raw.columns = col_names
df_train_raw.head()
| id | cycle | setting1 | setting2 | setting3 | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 | s11 | s12 | s13 | s14 | s15 | s16 | s17 | s18 | s19 | s20 | s21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | -0.0007 | -0.0004 | 100.0 | 518.67 | 641.82 | 1589.70 | 1400.60 | 14.62 | 21.61 | 554.36 | 2388.06 | 9046.19 | 1.3 | 47.47 | 521.66 | 2388.02 | 8138.62 | 8.4195 | 0.03 | 392 | 2388 | 100.0 | 39.06 | 23.4190 |
| 1 | 1 | 2 | 0.0019 | -0.0003 | 100.0 | 518.67 | 642.15 | 1591.82 | 1403.14 | 14.62 | 21.61 | 553.75 | 2388.04 | 9044.07 | 1.3 | 47.49 | 522.28 | 2388.07 | 8131.49 | 8.4318 | 0.03 | 392 | 2388 | 100.0 | 39.00 | 23.4236 |
| 2 | 1 | 3 | -0.0043 | 0.0003 | 100.0 | 518.67 | 642.35 | 1587.99 | 1404.20 | 14.62 | 21.61 | 554.26 | 2388.08 | 9052.94 | 1.3 | 47.27 | 522.42 | 2388.03 | 8133.23 | 8.4178 | 0.03 | 390 | 2388 | 100.0 | 38.95 | 23.3442 |
| 3 | 1 | 4 | 0.0007 | 0.0000 | 100.0 | 518.67 | 642.35 | 1582.79 | 1401.87 | 14.62 | 21.61 | 554.45 | 2388.11 | 9049.48 | 1.3 | 47.13 | 522.86 | 2388.08 | 8133.83 | 8.3682 | 0.03 | 392 | 2388 | 100.0 | 38.88 | 23.3739 |
| 4 | 1 | 5 | -0.0019 | -0.0002 | 100.0 | 518.67 | 642.37 | 1582.85 | 1406.22 | 14.62 | 21.61 | 554.00 | 2388.06 | 9055.15 | 1.3 | 47.28 | 522.19 | 2388.04 | 8133.80 | 8.4294 | 0.03 | 393 | 2388 | 100.0 | 38.90 | 23.4044 |
df_test_raw = pd.read_csv('PM_test.txt', sep = ' ', header=None)
df_test_raw.drop([26,27], axis=1, inplace=True)
df_test_raw.columns = col_names
df_test_raw.head()
| id | cycle | setting1 | setting2 | setting3 | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 | s11 | s12 | s13 | s14 | s15 | s16 | s17 | s18 | s19 | s20 | s21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0.0023 | 0.0003 | 100.0 | 518.67 | 643.02 | 1585.29 | 1398.21 | 14.62 | 21.61 | 553.90 | 2388.04 | 9050.17 | 1.3 | 47.20 | 521.72 | 2388.03 | 8125.55 | 8.4052 | 0.03 | 392 | 2388 | 100.0 | 38.86 | 23.3735 |
| 1 | 1 | 2 | -0.0027 | -0.0003 | 100.0 | 518.67 | 641.71 | 1588.45 | 1395.42 | 14.62 | 21.61 | 554.85 | 2388.01 | 9054.42 | 1.3 | 47.50 | 522.16 | 2388.06 | 8139.62 | 8.3803 | 0.03 | 393 | 2388 | 100.0 | 39.02 | 23.3916 |
| 2 | 1 | 3 | 0.0003 | 0.0001 | 100.0 | 518.67 | 642.46 | 1586.94 | 1401.34 | 14.62 | 21.61 | 554.11 | 2388.05 | 9056.96 | 1.3 | 47.50 | 521.97 | 2388.03 | 8130.10 | 8.4441 | 0.03 | 393 | 2388 | 100.0 | 39.08 | 23.4166 |
| 3 | 1 | 4 | 0.0042 | 0.0000 | 100.0 | 518.67 | 642.44 | 1584.12 | 1406.42 | 14.62 | 21.61 | 554.07 | 2388.03 | 9045.29 | 1.3 | 47.28 | 521.38 | 2388.05 | 8132.90 | 8.3917 | 0.03 | 391 | 2388 | 100.0 | 39.00 | 23.3737 |
| 4 | 1 | 5 | 0.0014 | 0.0000 | 100.0 | 518.67 | 642.51 | 1587.19 | 1401.92 | 14.62 | 21.61 | 554.16 | 2388.01 | 9044.55 | 1.3 | 47.31 | 522.15 | 2388.03 | 8129.54 | 8.4031 | 0.03 | 390 | 2388 | 100.0 | 38.99 | 23.4130 |
df_truth = pd.read_csv('PM_truth.txt', sep = ' ', header=None)
df_truth.drop([1], axis=1, inplace=True)
df_truth.columns = ['ttf']
df_truth.head()
| ttf | |
|---|---|
| 0 | 112 |
| 1 | 98 |
| 2 | 69 |
| 3 | 82 |
| 4 | 91 |
2.2 Dataset description¶
Data Source¶
___Training Data:___ The aircraft engine run-to-failure data.
download trianing data
___Test Data:___ The aircraft engine operating data without failure events recorded.
download test data
___Ground Truth Data:___ The true remaining cycles for each engine in the testing data.
download truth data
Data Columns¶
• id: is the engine ID, ranging from 1 to 100
• cycle: per engine sequence, starts from 1 to the cycle number where failure had happened (trining data only)
• setting1 to setting3: engine operational settings
• s1 to s21: sensors measurements
Data Source Training Data: The aircraft engine run-to-failure data. download trianing data Test Data: The aircraft engine operating data without failure events recorded. download test data
• id: is the engine ID, ranging from 1 to 100 • cycle: per engine sequence, starts from 1 to the cycle number where failure had happened (trining data only) • setting1 to setting3: engine operational settings • s1 to s21: sensors measurements
Same as training data, there are 100 engines, each engine has between 1 to 303 cycles (average of 76 cycles per engine). But this time, failure cycle was not provided.
Failure events for test data - remaining cycles before failure (TTF) - were provided in a separate truth file.
To get meaningful test data, we need to merge the truth data (TTF) with last cycle for each engine in the test data. This will give us a test set of 100 engines with their TTF data. Will do that later when we create regression and classification labels for both training and test data.
But now let us add some features to smooth the sensors reading: rolling average and rolling standard deviation.
There are 100 engines. each engine has between 1 to 362 cycles (average of 108 cycles per engine). The last cycle for each engine represents the cycle when failure had happened.
Same as training data, there are 100 engines, each engine has between 1 to 303 cycles (average of 76 cycles per engine). But this time, failure cycle was not provided.
Failure events for test data - remaining cycles before failure (TTF) - were provided in a separate truth file.
2.3 Feature Extraction¶
import functions_library as fl
df_train_fx = fl.add_features(df_train_raw, 5)
df_train = fl.prepare_train_data (df_train_fx, 30)
df_train.to_csv('train.csv', index=False)
df_test_fx = fl.add_features(df_test_raw, 5)
df_test = fl.prepare_test_data (df_test_fx, df_truth, 30)
df_test.to_csv('test.csv', index=False)
# Import train / test dataset
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
3. Exploratory Data Analysis¶
3.1 Unviariate Analysis¶
The first step of a machine learning project is to perform an EDA (Exploratory Data Analysis) in order to grasp the main dynamics of the data. This step is interesting to perform a robust Feature Engineering.
train.head()
| id | cycle | setting1 | setting2 | setting3 | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 | s11 | s12 | s13 | s14 | s15 | s16 | s17 | s18 | s19 | s20 | s21 | av1 | av2 | av3 | av4 | av5 | av6 | av7 | av8 | av9 | av10 | av11 | av12 | av13 | av14 | av15 | av16 | av17 | av18 | av19 | av20 | av21 | sd1 | sd2 | sd3 | sd4 | sd5 | sd6 | sd7 | sd8 | sd9 | sd10 | sd11 | sd12 | sd13 | sd14 | sd15 | sd16 | sd17 | sd18 | sd19 | sd20 | sd21 | ttf | label_bnc | label_mcc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | -0.0007 | -0.0004 | 100.0 | 518.67 | 641.82 | 1589.70 | 1400.60 | 14.62 | 21.61 | 554.36 | 2388.06 | 9046.19 | 1.3 | 47.47 | 521.66 | 2388.02 | 8138.62 | 8.4195 | 0.03 | 392 | 2388 | 100.0 | 39.06 | 23.4190 | 518.67 | 641.820000 | 1589.700000 | 1400.600000 | 14.62 | 21.61 | 554.360000 | 2388.0600 | 9046.190000 | 1.3 | 47.470 | 521.660 | 2388.020 | 8138.620000 | 8.419500 | 0.03 | 392.000000 | 2388.0 | 100.0 | 39.060000 | 23.419000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 191 | 0 | 0 |
| 1 | 1 | 2 | 0.0019 | -0.0003 | 100.0 | 518.67 | 642.15 | 1591.82 | 1403.14 | 14.62 | 21.61 | 553.75 | 2388.04 | 9044.07 | 1.3 | 47.49 | 522.28 | 2388.07 | 8131.49 | 8.4318 | 0.03 | 392 | 2388 | 100.0 | 39.00 | 23.4236 | 518.67 | 641.985000 | 1590.760000 | 1401.870000 | 14.62 | 21.61 | 554.055000 | 2388.0500 | 9045.130000 | 1.3 | 47.480 | 521.970 | 2388.045 | 8135.055000 | 8.425650 | 0.03 | 392.000000 | 2388.0 | 100.0 | 39.030000 | 23.421300 | 0.0 | 0.233345 | 1.499066 | 1.796051 | 0.0 | 0.0 | 0.431335 | 0.014142 | 1.499066 | 0.0 | 0.014142 | 0.438406 | 0.035355 | 5.041671 | 0.008697 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.042426 | 0.003253 | 190 | 0 | 0 |
| 2 | 1 | 3 | -0.0043 | 0.0003 | 100.0 | 518.67 | 642.35 | 1587.99 | 1404.20 | 14.62 | 21.61 | 554.26 | 2388.08 | 9052.94 | 1.3 | 47.27 | 522.42 | 2388.03 | 8133.23 | 8.4178 | 0.03 | 390 | 2388 | 100.0 | 38.95 | 23.3442 | 518.67 | 642.106667 | 1589.836667 | 1402.646667 | 14.62 | 21.61 | 554.123333 | 2388.0600 | 9047.733333 | 1.3 | 47.410 | 522.120 | 2388.040 | 8134.446667 | 8.423033 | 0.03 | 391.333333 | 2388.0 | 100.0 | 39.003333 | 23.395600 | 0.0 | 0.267644 | 1.918654 | 1.850009 | 0.0 | 0.0 | 0.327159 | 0.020000 | 4.632023 | 0.0 | 0.121655 | 0.404475 | 0.026458 | 3.717450 | 0.007640 | 0.0 | 1.154701 | 0.0 | 0.0 | 0.055076 | 0.044573 | 189 | 0 | 0 |
| 3 | 1 | 4 | 0.0007 | 0.0000 | 100.0 | 518.67 | 642.35 | 1582.79 | 1401.87 | 14.62 | 21.61 | 554.45 | 2388.11 | 9049.48 | 1.3 | 47.13 | 522.86 | 2388.08 | 8133.83 | 8.3682 | 0.03 | 392 | 2388 | 100.0 | 38.88 | 23.3739 | 518.67 | 642.167500 | 1588.075000 | 1402.452500 | 14.62 | 21.61 | 554.205000 | 2388.0725 | 9048.170000 | 1.3 | 47.340 | 522.305 | 2388.050 | 8134.292500 | 8.409325 | 0.03 | 391.500000 | 2388.0 | 100.0 | 38.972500 | 23.390175 | 0.0 | 0.250117 | 3.855909 | 1.559645 | 0.0 | 0.0 | 0.313103 | 0.029861 | 3.881555 | 0.0 | 0.171659 | 0.495950 | 0.029439 | 3.050906 | 0.028117 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.076322 | 0.037977 | 188 | 0 | 0 |
| 4 | 1 | 5 | -0.0019 | -0.0002 | 100.0 | 518.67 | 642.37 | 1582.85 | 1406.22 | 14.62 | 21.61 | 554.00 | 2388.06 | 9055.15 | 1.3 | 47.28 | 522.19 | 2388.04 | 8133.80 | 8.4294 | 0.03 | 393 | 2388 | 100.0 | 38.90 | 23.4044 | 518.67 | 642.208000 | 1587.030000 | 1403.206000 | 14.62 | 21.61 | 554.164000 | 2388.0700 | 9049.566000 | 1.3 | 47.328 | 522.282 | 2388.048 | 8134.194000 | 8.413340 | 0.03 | 391.800000 | 2388.0 | 100.0 | 38.958000 | 23.393020 | 0.0 | 0.234776 | 4.075678 | 2.159440 | 0.0 | 0.0 | 0.286234 | 0.026458 | 4.587366 | 0.0 | 0.151063 | 0.432574 | 0.025884 | 2.651326 | 0.025953 | 0.0 | 1.095445 | 0.0 | 0.0 | 0.073621 | 0.033498 | 187 | 0 | 0 |
train = fl.remove_columns_with_same_min_max(train)
# train.rename(columns=parameters_dict, inplace=True)
fl.custom_describe(train)
| mean | std | min | 25% | 50% | 75% | max | count | zero_count | nan_count | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 51.506568 | 29.227633 | 1.0000 | 26.000000 | 52.000000 | 77.000000 | 100.000000 | 20631.0 | 0.0 | 0.0 | -0.067810 | -1.219819 |
| cycle | 108.807862 | 68.880990 | 1.0000 | 52.000000 | 104.000000 | 156.000000 | 362.000000 | 20631.0 | 0.0 | 0.0 | 0.499868 | -0.218777 |
| setting1 | -0.000009 | 0.002187 | -0.0087 | -0.001500 | 0.000000 | 0.001500 | 0.008700 | 20631.0 | 413.0 | 0.0 | -0.024764 | -0.009420 |
| setting2 | 0.000002 | 0.000293 | -0.0006 | -0.000200 | 0.000000 | 0.000300 | 0.000600 | 20631.0 | 2070.0 | 0.0 | 0.009084 | -1.130464 |
| s2 | 642.680934 | 0.500053 | 641.2100 | 642.325000 | 642.640000 | 643.000000 | 644.530000 | 20631.0 | 0.0 | 0.0 | 0.316503 | -0.112307 |
| s3 | 1590.523119 | 6.131150 | 1571.0400 | 1586.260000 | 1590.100000 | 1594.380000 | 1616.910000 | 20631.0 | 0.0 | 0.0 | 0.308923 | 0.007469 |
| s4 | 1408.933782 | 9.000605 | 1382.2500 | 1402.360000 | 1408.040000 | 1414.555000 | 1441.490000 | 20631.0 | 0.0 | 0.0 | 0.443162 | -0.163932 |
| s6 | 21.609803 | 0.001389 | 21.6000 | 21.610000 | 21.610000 | 21.610000 | 21.610000 | 20631.0 | 0.0 | 0.0 | -6.916310 | 45.835345 |
| s7 | 553.367711 | 0.885092 | 549.8500 | 552.810000 | 553.440000 | 554.010000 | 556.060000 | 20631.0 | 0.0 | 0.0 | -0.394300 | -0.158202 |
| s8 | 2388.096652 | 0.070985 | 2387.9000 | 2388.050000 | 2388.090000 | 2388.140000 | 2388.560000 | 20631.0 | 0.0 | 0.0 | 0.479376 | 0.332777 |
| s9 | 9065.242941 | 22.082880 | 9021.7300 | 9053.100000 | 9060.660000 | 9069.420000 | 9244.590000 | 20631.0 | 0.0 | 0.0 | 2.555179 | 9.376118 |
| s11 | 47.541168 | 0.267087 | 46.8500 | 47.350000 | 47.510000 | 47.700000 | 48.530000 | 20631.0 | 0.0 | 0.0 | 0.469295 | -0.172441 |
| s12 | 521.413470 | 0.737553 | 518.6900 | 520.960000 | 521.480000 | 521.950000 | 523.380000 | 20631.0 | 0.0 | 0.0 | -0.442375 | -0.145172 |
| s13 | 2388.096152 | 0.071919 | 2387.8800 | 2388.040000 | 2388.090000 | 2388.140000 | 2388.560000 | 20631.0 | 0.0 | 0.0 | 0.469758 | 0.386859 |
| s14 | 8143.752722 | 19.076176 | 8099.9400 | 8133.245000 | 8140.540000 | 8148.310000 | 8293.720000 | 20631.0 | 0.0 | 0.0 | 2.372381 | 8.852228 |
| s15 | 8.442146 | 0.037505 | 8.3249 | 8.414900 | 8.438900 | 8.465600 | 8.584800 | 20631.0 | 0.0 | 0.0 | 0.388230 | -0.121691 |
| s17 | 393.210654 | 1.548763 | 388.0000 | 392.000000 | 393.000000 | 394.000000 | 400.000000 | 20631.0 | 0.0 | 0.0 | 0.353100 | -0.039455 |
| s20 | 38.816271 | 0.180746 | 38.1400 | 38.700000 | 38.830000 | 38.950000 | 39.430000 | 20631.0 | 0.0 | 0.0 | -0.358419 | -0.113093 |
| s21 | 23.289705 | 0.108251 | 22.8942 | 23.221800 | 23.297900 | 23.366800 | 23.618400 | 20631.0 | 0.0 | 0.0 | -0.350349 | -0.117302 |
| av2 | 642.668177 | 0.410199 | 641.4500 | 642.366000 | 642.624000 | 642.918000 | 644.020000 | 20631.0 | 0.0 | 0.0 | 0.479614 | -0.208867 |
| av3 | 1590.371974 | 4.864211 | 1573.0200 | 1586.869000 | 1589.760000 | 1593.192000 | 1608.070000 | 20631.0 | 0.0 | 0.0 | 0.560107 | -0.017956 |
| av4 | 1408.670908 | 8.032418 | 1387.3800 | 1402.633000 | 1407.712000 | 1413.483000 | 1432.984000 | 20631.0 | 0.0 | 0.0 | 0.532997 | -0.273388 |
| av6 | 21.609798 | 0.000734 | 21.6000 | 21.610000 | 21.610000 | 21.610000 | 21.610000 | 20631.0 | 0.0 | 0.0 | -4.610210 | 29.020033 |
| av7 | 553.392492 | 0.787783 | 550.8100 | 552.934000 | 553.460000 | 553.996000 | 555.370000 | 20631.0 | 0.0 | 0.0 | -0.479222 | -0.266622 |
| av8 | 2388.094887 | 0.064036 | 2387.9200 | 2388.046000 | 2388.092000 | 2388.134000 | 2388.326000 | 20631.0 | 0.0 | 0.0 | 0.522656 | 0.100060 |
| av9 | 9064.827207 | 20.825535 | 9027.6520 | 9053.788000 | 9060.288000 | 9068.334000 | 9223.844000 | 20631.0 | 0.0 | 0.0 | 2.592280 | 9.451631 |
| av11 | 47.533241 | 0.244682 | 46.9300 | 47.348000 | 47.510000 | 47.676000 | 48.274000 | 20631.0 | 0.0 | 0.0 | 0.514913 | -0.277990 |
| av12 | 521.434437 | 0.669201 | 519.2240 | 521.046000 | 521.488000 | 521.944000 | 522.970000 | 20631.0 | 0.0 | 0.0 | -0.499478 | -0.251130 |
| av13 | 2388.094344 | 0.064908 | 2387.9300 | 2388.046000 | 2388.092000 | 2388.132000 | 2388.338000 | 20631.0 | 0.0 | 0.0 | 0.492525 | 0.058653 |
| av14 | 8143.460137 | 18.072103 | 8103.4660 | 8133.375000 | 8140.468000 | 8147.797000 | 8281.152000 | 20631.0 | 0.0 | 0.0 | 2.370463 | 8.766225 |
| av15 | 8.441132 | 0.032072 | 8.3303 | 8.416830 | 8.437720 | 8.459970 | 8.542420 | 20631.0 | 0.0 | 0.0 | 0.537446 | -0.210313 |
| av17 | 393.169949 | 1.263436 | 390.0000 | 392.200000 | 393.000000 | 393.800000 | 397.600000 | 20631.0 | 0.0 | 0.0 | 0.540343 | -0.121566 |
| av20 | 38.821209 | 0.152667 | 38.3260 | 38.730000 | 38.838000 | 38.934000 | 39.270000 | 20631.0 | 0.0 | 0.0 | -0.492091 | -0.209058 |
| av21 | 23.292606 | 0.091651 | 23.0114 | 23.238740 | 23.301540 | 23.360810 | 23.534900 | 20631.0 | 0.0 | 0.0 | -0.502557 | -0.210533 |
| sd2 | 0.282887 | 0.105588 | 0.0000 | 0.207437 | 0.276622 | 0.349614 | 0.783913 | 20631.0 | 102.0 | 0.0 | 0.343976 | 0.120458 |
| sd3 | 3.743819 | 1.405689 | 0.0000 | 2.756107 | 3.669396 | 4.637603 | 12.445079 | 20631.0 | 100.0 | 0.0 | 0.349888 | 0.305709 |
| sd4 | 3.772323 | 1.402446 | 0.0000 | 2.784935 | 3.703867 | 4.677819 | 15.259364 | 20631.0 | 100.0 | 0.0 | 0.337459 | 0.469149 |
| sd6 | 0.000387 | 0.001291 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.007071 | 20631.0 | 18912.0 | 0.0 | 3.066845 | 7.562016 |
| sd7 | 0.379234 | 0.145381 | 0.0000 | 0.276695 | 0.369486 | 0.473777 | 1.350444 | 20631.0 | 100.0 | 0.0 | 0.358557 | 0.274632 |
| sd8 | 0.028255 | 0.010881 | 0.0000 | 0.020736 | 0.027749 | 0.035071 | 0.200699 | 20631.0 | 110.0 | 0.0 | 0.930979 | 8.738667 |
| sd9 | 3.946680 | 1.513374 | 0.0000 | 2.884196 | 3.824288 | 4.899879 | 12.694602 | 20631.0 | 100.0 | 0.0 | 0.430050 | 0.365064 |
| sd11 | 0.095131 | 0.035433 | 0.0000 | 0.070000 | 0.092844 | 0.118025 | 0.247245 | 20631.0 | 101.0 | 0.0 | 0.330156 | 0.128424 |
| sd12 | 0.281756 | 0.105403 | 0.0000 | 0.207244 | 0.274918 | 0.348482 | 0.898026 | 20631.0 | 101.0 | 0.0 | 0.374051 | 0.309839 |
| sd13 | 0.028451 | 0.011049 | 0.0000 | 0.020736 | 0.027749 | 0.035355 | 0.220068 | 20631.0 | 118.0 | 0.0 | 1.150325 | 12.700276 |
| sd14 | 3.016360 | 1.143912 | 0.0000 | 2.219386 | 2.939243 | 3.715888 | 9.919746 | 20631.0 | 100.0 | 0.0 | 0.489364 | 0.735085 |
| sd15 | 0.018802 | 0.007103 | 0.0000 | 0.013724 | 0.018365 | 0.023380 | 0.065761 | 20631.0 | 100.0 | 0.0 | 0.371387 | 0.347628 |
| sd17 | 0.885173 | 0.341989 | 0.0000 | 0.547723 | 0.836660 | 1.140175 | 2.828427 | 20631.0 | 460.0 | 0.0 | 0.245099 | 0.546409 |
| sd20 | 0.094636 | 0.034867 | 0.0000 | 0.070071 | 0.092736 | 0.117132 | 0.325269 | 20631.0 | 103.0 | 0.0 | 0.303100 | 0.154156 |
| sd21 | 0.056483 | 0.020669 | 0.0000 | 0.041825 | 0.055555 | 0.069839 | 0.187313 | 20631.0 | 100.0 | 0.0 | 0.326198 | 0.445241 |
| ttf | 107.807862 | 68.880990 | 0.0000 | 51.000000 | 103.000000 | 155.000000 | 361.000000 | 20631.0 | 100.0 | 0.0 | 0.499868 | -0.218777 |
| label_bnc | 0.150259 | 0.357334 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 20631.0 | 17531.0 | 0.0 | 1.957547 | 1.831991 |
| label_mcc | 0.227813 | 0.575358 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 20631.0 | 17531.0 | 0.0 | 2.389476 | 4.237983 |
# train.select_dtypes(include='number').columns.to_list()
col_to_analyze = ['setting1','setting2','s2','s3','s4','s6','s7','s8','s9',
's11','s12','s13','s14','s15','s17','s20','s21', 'ttf']
# col_to_analyze = ['cycle',
# 'setting1',
# 'setting2',
# '(LPC outlet temperature) (◦R)',
# '(HPC outlet temperature) (◦R)',
# '(LPT outlet temperature) (◦R)',
# '(bypass-duct pressure) (psia)',
# '(HPC outlet pressure) (psia)',
# '(Physical fan speed) (rpm)',
# '(Physical core speed) (rpm)',
# '(HPC outlet Static pressure) (psia)',
# '(Ratio of fuel flow to Ps30) (pps/psia)',
# '(Corrected fan speed) (rpm)',
# '(Corrected core speed) (rpm)',
# '(Bypass Ratio) ',
# '(Bleed Enthalpy)',
# '(High-pressure turbines Cool air flow)',
# '(Low-pressure turbines Cool air flow)', 'ttf']
We can see that the value of several sensor are stable and not moving. We can remove these features
fl.plot_histograms(train[col_to_analyze])
# train['s9'].mean()
# Log transformation
# for col in ['s6', 's9', 's14']:
# train[col] = train[col].apply(np.log)
# test[col] = test[col].apply(np.log)
['s12', 's7', 's21', 's20', 's6', 's14', 's9', 's13', 's8', 's3', 's17', 's2', 's15', 's4', 's11'] could be target for feature selection during modeling since their correlation with TTF is higher than other features.
Let us disply this correlation in heatmap
There is a very high correlation (> 0.8) between some features: (s14, s9), (s11, s4), (s11, s7), (s11, s12), (s4, s12), (s8,s13), (s7, s12)
This may hurt the performance of some ML algorithms.
So, some of the above features will be target for removal in feature selection. Some algorithm are struggled with colinearity.
# Create a function to explore the time series plot each sensor selecting random sample engines
def plot_time_series(s):
"""Plot time series of a single sensor for 10 random sample engines.
Args:
s (str): The column name of the sensor to be plotted.
Returns:
plots
"""
fig, axes = plt.subplots(10, 1, sharex=True, figsize = (15, 15))
fig.suptitle(s + ' time series / cycle', fontsize=15)
#np.random.seed(12345)
select_engines = np.random.choice(range(1,101), 10, replace=False).tolist()
for e_id in select_engines:
df = train[['cycle', s]][train.id == e_id]
i = select_engines.index(e_id)
axes[i].plot(df['cycle'],df[s])
axes[i].set_ylabel('engine ' + str(e_id))
axes[i].set_xlabel('cycle')
#axes[i].set_title('engine ' + str(e_id), loc='right')
#plt.tight_layout()
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()
plot_time_series('s9')
3.2 Bivariate & Correlation Analysis¶
correl_featurs = ['setting1','setting2','s2','s3','s4','s6','s7','s8','s9','s11','s12','s13','s14','s15','s17','s20','s21']
correl_featurs_lbl = correl_featurs + ['ttf']
# col_to_analyze = ['cycle',
# 'setting1',
# 'setting2',
# '(LPC outlet temperature) (◦R)',
# '(HPC outlet temperature) (◦R)',
# '(LPT outlet temperature) (◦R)',
# '(bypass-duct pressure) (psia)',
# '(HPC outlet pressure) (psia)',
# '(Physical fan speed) (rpm)',
# '(Physical core speed) (rpm)',
# '(HPC outlet Static pressure) (psia)',
# '(Ratio of fuel flow to Ps30) (pps/psia)',
# '(Corrected fan speed) (rpm)',
# '(Corrected core speed) (rpm)',
# '(Bypass Ratio) ',
# '(Bleed Enthalpy)',
# '(High-pressure turbines Cool air flow)',
# '(Low-pressure turbines Cool air flow)', 'ttf']
# train.rename(columns=parameters_dict, inplace=True)
# correl_featurs_lbl = col_to_analyze
# plot a heatmap to display +ve and -ve correlation among features and regression label:
cm = np.corrcoef(train[correl_featurs_lbl].values.T)
sns.set(font_scale=1.0)
fig = plt.figure(figsize=(10, 8))
hm = sns.clustermap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 8},
yticklabels=correl_featurs_lbl, xticklabels=correl_featurs_lbl, cmap="vlag")
plt.title('Features Correlation Heatmap')
plt.show()
<Figure size 1000x800 with 0 Axes>
# from pandas.plotting import scatter_matrix
#create scatter matrix to disply relatiohships and distribution among features and regression label
# scatter_matrix(train[correl_featurs_lbl], alpha=0.2, figsize=(20, 20), diagonal='kde')
array([[<Axes: xlabel='setting1', ylabel='setting1'>,
<Axes: xlabel='setting2', ylabel='setting1'>,
<Axes: xlabel='s2', ylabel='setting1'>,
<Axes: xlabel='s3', ylabel='setting1'>,
<Axes: xlabel='s4', ylabel='setting1'>,
<Axes: xlabel='s6', ylabel='setting1'>,
<Axes: xlabel='s7', ylabel='setting1'>,
<Axes: xlabel='s8', ylabel='setting1'>,
<Axes: xlabel='s9', ylabel='setting1'>,
<Axes: xlabel='s11', ylabel='setting1'>,
<Axes: xlabel='s12', ylabel='setting1'>,
<Axes: xlabel='s13', ylabel='setting1'>,
<Axes: xlabel='s14', ylabel='setting1'>,
<Axes: xlabel='s15', ylabel='setting1'>,
<Axes: xlabel='s17', ylabel='setting1'>,
<Axes: xlabel='s20', ylabel='setting1'>,
<Axes: xlabel='s21', ylabel='setting1'>,
<Axes: xlabel='ttf', ylabel='setting1'>],
[<Axes: xlabel='setting1', ylabel='setting2'>,
<Axes: xlabel='setting2', ylabel='setting2'>,
<Axes: xlabel='s2', ylabel='setting2'>,
<Axes: xlabel='s3', ylabel='setting2'>,
<Axes: xlabel='s4', ylabel='setting2'>,
<Axes: xlabel='s6', ylabel='setting2'>,
<Axes: xlabel='s7', ylabel='setting2'>,
<Axes: xlabel='s8', ylabel='setting2'>,
<Axes: xlabel='s9', ylabel='setting2'>,
<Axes: xlabel='s11', ylabel='setting2'>,
<Axes: xlabel='s12', ylabel='setting2'>,
<Axes: xlabel='s13', ylabel='setting2'>,
<Axes: xlabel='s14', ylabel='setting2'>,
<Axes: xlabel='s15', ylabel='setting2'>,
<Axes: xlabel='s17', ylabel='setting2'>,
<Axes: xlabel='s20', ylabel='setting2'>,
<Axes: xlabel='s21', ylabel='setting2'>,
<Axes: xlabel='ttf', ylabel='setting2'>],
[<Axes: xlabel='setting1', ylabel='s2'>,
<Axes: xlabel='setting2', ylabel='s2'>,
<Axes: xlabel='s2', ylabel='s2'>,
<Axes: xlabel='s3', ylabel='s2'>,
<Axes: xlabel='s4', ylabel='s2'>,
<Axes: xlabel='s6', ylabel='s2'>,
<Axes: xlabel='s7', ylabel='s2'>,
<Axes: xlabel='s8', ylabel='s2'>,
<Axes: xlabel='s9', ylabel='s2'>,
<Axes: xlabel='s11', ylabel='s2'>,
<Axes: xlabel='s12', ylabel='s2'>,
<Axes: xlabel='s13', ylabel='s2'>,
<Axes: xlabel='s14', ylabel='s2'>,
<Axes: xlabel='s15', ylabel='s2'>,
<Axes: xlabel='s17', ylabel='s2'>,
<Axes: xlabel='s20', ylabel='s2'>,
<Axes: xlabel='s21', ylabel='s2'>,
<Axes: xlabel='ttf', ylabel='s2'>],
[<Axes: xlabel='setting1', ylabel='s3'>,
<Axes: xlabel='setting2', ylabel='s3'>,
<Axes: xlabel='s2', ylabel='s3'>,
<Axes: xlabel='s3', ylabel='s3'>,
<Axes: xlabel='s4', ylabel='s3'>,
<Axes: xlabel='s6', ylabel='s3'>,
<Axes: xlabel='s7', ylabel='s3'>,
<Axes: xlabel='s8', ylabel='s3'>,
<Axes: xlabel='s9', ylabel='s3'>,
<Axes: xlabel='s11', ylabel='s3'>,
<Axes: xlabel='s12', ylabel='s3'>,
<Axes: xlabel='s13', ylabel='s3'>,
<Axes: xlabel='s14', ylabel='s3'>,
<Axes: xlabel='s15', ylabel='s3'>,
<Axes: xlabel='s17', ylabel='s3'>,
<Axes: xlabel='s20', ylabel='s3'>,
<Axes: xlabel='s21', ylabel='s3'>,
<Axes: xlabel='ttf', ylabel='s3'>],
[<Axes: xlabel='setting1', ylabel='s4'>,
<Axes: xlabel='setting2', ylabel='s4'>,
<Axes: xlabel='s2', ylabel='s4'>,
<Axes: xlabel='s3', ylabel='s4'>,
<Axes: xlabel='s4', ylabel='s4'>,
<Axes: xlabel='s6', ylabel='s4'>,
<Axes: xlabel='s7', ylabel='s4'>,
<Axes: xlabel='s8', ylabel='s4'>,
<Axes: xlabel='s9', ylabel='s4'>,
<Axes: xlabel='s11', ylabel='s4'>,
<Axes: xlabel='s12', ylabel='s4'>,
<Axes: xlabel='s13', ylabel='s4'>,
<Axes: xlabel='s14', ylabel='s4'>,
<Axes: xlabel='s15', ylabel='s4'>,
<Axes: xlabel='s17', ylabel='s4'>,
<Axes: xlabel='s20', ylabel='s4'>,
<Axes: xlabel='s21', ylabel='s4'>,
<Axes: xlabel='ttf', ylabel='s4'>],
[<Axes: xlabel='setting1', ylabel='s6'>,
<Axes: xlabel='setting2', ylabel='s6'>,
<Axes: xlabel='s2', ylabel='s6'>,
<Axes: xlabel='s3', ylabel='s6'>,
<Axes: xlabel='s4', ylabel='s6'>,
<Axes: xlabel='s6', ylabel='s6'>,
<Axes: xlabel='s7', ylabel='s6'>,
<Axes: xlabel='s8', ylabel='s6'>,
<Axes: xlabel='s9', ylabel='s6'>,
<Axes: xlabel='s11', ylabel='s6'>,
<Axes: xlabel='s12', ylabel='s6'>,
<Axes: xlabel='s13', ylabel='s6'>,
<Axes: xlabel='s14', ylabel='s6'>,
<Axes: xlabel='s15', ylabel='s6'>,
<Axes: xlabel='s17', ylabel='s6'>,
<Axes: xlabel='s20', ylabel='s6'>,
<Axes: xlabel='s21', ylabel='s6'>,
<Axes: xlabel='ttf', ylabel='s6'>],
[<Axes: xlabel='setting1', ylabel='s7'>,
<Axes: xlabel='setting2', ylabel='s7'>,
<Axes: xlabel='s2', ylabel='s7'>,
<Axes: xlabel='s3', ylabel='s7'>,
<Axes: xlabel='s4', ylabel='s7'>,
<Axes: xlabel='s6', ylabel='s7'>,
<Axes: xlabel='s7', ylabel='s7'>,
<Axes: xlabel='s8', ylabel='s7'>,
<Axes: xlabel='s9', ylabel='s7'>,
<Axes: xlabel='s11', ylabel='s7'>,
<Axes: xlabel='s12', ylabel='s7'>,
<Axes: xlabel='s13', ylabel='s7'>,
<Axes: xlabel='s14', ylabel='s7'>,
<Axes: xlabel='s15', ylabel='s7'>,
<Axes: xlabel='s17', ylabel='s7'>,
<Axes: xlabel='s20', ylabel='s7'>,
<Axes: xlabel='s21', ylabel='s7'>,
<Axes: xlabel='ttf', ylabel='s7'>],
[<Axes: xlabel='setting1', ylabel='s8'>,
<Axes: xlabel='setting2', ylabel='s8'>,
<Axes: xlabel='s2', ylabel='s8'>,
<Axes: xlabel='s3', ylabel='s8'>,
<Axes: xlabel='s4', ylabel='s8'>,
<Axes: xlabel='s6', ylabel='s8'>,
<Axes: xlabel='s7', ylabel='s8'>,
<Axes: xlabel='s8', ylabel='s8'>,
<Axes: xlabel='s9', ylabel='s8'>,
<Axes: xlabel='s11', ylabel='s8'>,
<Axes: xlabel='s12', ylabel='s8'>,
<Axes: xlabel='s13', ylabel='s8'>,
<Axes: xlabel='s14', ylabel='s8'>,
<Axes: xlabel='s15', ylabel='s8'>,
<Axes: xlabel='s17', ylabel='s8'>,
<Axes: xlabel='s20', ylabel='s8'>,
<Axes: xlabel='s21', ylabel='s8'>,
<Axes: xlabel='ttf', ylabel='s8'>],
[<Axes: xlabel='setting1', ylabel='s9'>,
<Axes: xlabel='setting2', ylabel='s9'>,
<Axes: xlabel='s2', ylabel='s9'>,
<Axes: xlabel='s3', ylabel='s9'>,
<Axes: xlabel='s4', ylabel='s9'>,
<Axes: xlabel='s6', ylabel='s9'>,
<Axes: xlabel='s7', ylabel='s9'>,
<Axes: xlabel='s8', ylabel='s9'>,
<Axes: xlabel='s9', ylabel='s9'>,
<Axes: xlabel='s11', ylabel='s9'>,
<Axes: xlabel='s12', ylabel='s9'>,
<Axes: xlabel='s13', ylabel='s9'>,
<Axes: xlabel='s14', ylabel='s9'>,
<Axes: xlabel='s15', ylabel='s9'>,
<Axes: xlabel='s17', ylabel='s9'>,
<Axes: xlabel='s20', ylabel='s9'>,
<Axes: xlabel='s21', ylabel='s9'>,
<Axes: xlabel='ttf', ylabel='s9'>],
[<Axes: xlabel='setting1', ylabel='s11'>,
<Axes: xlabel='setting2', ylabel='s11'>,
<Axes: xlabel='s2', ylabel='s11'>,
<Axes: xlabel='s3', ylabel='s11'>,
<Axes: xlabel='s4', ylabel='s11'>,
<Axes: xlabel='s6', ylabel='s11'>,
<Axes: xlabel='s7', ylabel='s11'>,
<Axes: xlabel='s8', ylabel='s11'>,
<Axes: xlabel='s9', ylabel='s11'>,
<Axes: xlabel='s11', ylabel='s11'>,
<Axes: xlabel='s12', ylabel='s11'>,
<Axes: xlabel='s13', ylabel='s11'>,
<Axes: xlabel='s14', ylabel='s11'>,
<Axes: xlabel='s15', ylabel='s11'>,
<Axes: xlabel='s17', ylabel='s11'>,
<Axes: xlabel='s20', ylabel='s11'>,
<Axes: xlabel='s21', ylabel='s11'>,
<Axes: xlabel='ttf', ylabel='s11'>],
[<Axes: xlabel='setting1', ylabel='s12'>,
<Axes: xlabel='setting2', ylabel='s12'>,
<Axes: xlabel='s2', ylabel='s12'>,
<Axes: xlabel='s3', ylabel='s12'>,
<Axes: xlabel='s4', ylabel='s12'>,
<Axes: xlabel='s6', ylabel='s12'>,
<Axes: xlabel='s7', ylabel='s12'>,
<Axes: xlabel='s8', ylabel='s12'>,
<Axes: xlabel='s9', ylabel='s12'>,
<Axes: xlabel='s11', ylabel='s12'>,
<Axes: xlabel='s12', ylabel='s12'>,
<Axes: xlabel='s13', ylabel='s12'>,
<Axes: xlabel='s14', ylabel='s12'>,
<Axes: xlabel='s15', ylabel='s12'>,
<Axes: xlabel='s17', ylabel='s12'>,
<Axes: xlabel='s20', ylabel='s12'>,
<Axes: xlabel='s21', ylabel='s12'>,
<Axes: xlabel='ttf', ylabel='s12'>],
[<Axes: xlabel='setting1', ylabel='s13'>,
<Axes: xlabel='setting2', ylabel='s13'>,
<Axes: xlabel='s2', ylabel='s13'>,
<Axes: xlabel='s3', ylabel='s13'>,
<Axes: xlabel='s4', ylabel='s13'>,
<Axes: xlabel='s6', ylabel='s13'>,
<Axes: xlabel='s7', ylabel='s13'>,
<Axes: xlabel='s8', ylabel='s13'>,
<Axes: xlabel='s9', ylabel='s13'>,
<Axes: xlabel='s11', ylabel='s13'>,
<Axes: xlabel='s12', ylabel='s13'>,
<Axes: xlabel='s13', ylabel='s13'>,
<Axes: xlabel='s14', ylabel='s13'>,
<Axes: xlabel='s15', ylabel='s13'>,
<Axes: xlabel='s17', ylabel='s13'>,
<Axes: xlabel='s20', ylabel='s13'>,
<Axes: xlabel='s21', ylabel='s13'>,
<Axes: xlabel='ttf', ylabel='s13'>],
[<Axes: xlabel='setting1', ylabel='s14'>,
<Axes: xlabel='setting2', ylabel='s14'>,
<Axes: xlabel='s2', ylabel='s14'>,
<Axes: xlabel='s3', ylabel='s14'>,
<Axes: xlabel='s4', ylabel='s14'>,
<Axes: xlabel='s6', ylabel='s14'>,
<Axes: xlabel='s7', ylabel='s14'>,
<Axes: xlabel='s8', ylabel='s14'>,
<Axes: xlabel='s9', ylabel='s14'>,
<Axes: xlabel='s11', ylabel='s14'>,
<Axes: xlabel='s12', ylabel='s14'>,
<Axes: xlabel='s13', ylabel='s14'>,
<Axes: xlabel='s14', ylabel='s14'>,
<Axes: xlabel='s15', ylabel='s14'>,
<Axes: xlabel='s17', ylabel='s14'>,
<Axes: xlabel='s20', ylabel='s14'>,
<Axes: xlabel='s21', ylabel='s14'>,
<Axes: xlabel='ttf', ylabel='s14'>],
[<Axes: xlabel='setting1', ylabel='s15'>,
<Axes: xlabel='setting2', ylabel='s15'>,
<Axes: xlabel='s2', ylabel='s15'>,
<Axes: xlabel='s3', ylabel='s15'>,
<Axes: xlabel='s4', ylabel='s15'>,
<Axes: xlabel='s6', ylabel='s15'>,
<Axes: xlabel='s7', ylabel='s15'>,
<Axes: xlabel='s8', ylabel='s15'>,
<Axes: xlabel='s9', ylabel='s15'>,
<Axes: xlabel='s11', ylabel='s15'>,
<Axes: xlabel='s12', ylabel='s15'>,
<Axes: xlabel='s13', ylabel='s15'>,
<Axes: xlabel='s14', ylabel='s15'>,
<Axes: xlabel='s15', ylabel='s15'>,
<Axes: xlabel='s17', ylabel='s15'>,
<Axes: xlabel='s20', ylabel='s15'>,
<Axes: xlabel='s21', ylabel='s15'>,
<Axes: xlabel='ttf', ylabel='s15'>],
[<Axes: xlabel='setting1', ylabel='s17'>,
<Axes: xlabel='setting2', ylabel='s17'>,
<Axes: xlabel='s2', ylabel='s17'>,
<Axes: xlabel='s3', ylabel='s17'>,
<Axes: xlabel='s4', ylabel='s17'>,
<Axes: xlabel='s6', ylabel='s17'>,
<Axes: xlabel='s7', ylabel='s17'>,
<Axes: xlabel='s8', ylabel='s17'>,
<Axes: xlabel='s9', ylabel='s17'>,
<Axes: xlabel='s11', ylabel='s17'>,
<Axes: xlabel='s12', ylabel='s17'>,
<Axes: xlabel='s13', ylabel='s17'>,
<Axes: xlabel='s14', ylabel='s17'>,
<Axes: xlabel='s15', ylabel='s17'>,
<Axes: xlabel='s17', ylabel='s17'>,
<Axes: xlabel='s20', ylabel='s17'>,
<Axes: xlabel='s21', ylabel='s17'>,
<Axes: xlabel='ttf', ylabel='s17'>],
[<Axes: xlabel='setting1', ylabel='s20'>,
<Axes: xlabel='setting2', ylabel='s20'>,
<Axes: xlabel='s2', ylabel='s20'>,
<Axes: xlabel='s3', ylabel='s20'>,
<Axes: xlabel='s4', ylabel='s20'>,
<Axes: xlabel='s6', ylabel='s20'>,
<Axes: xlabel='s7', ylabel='s20'>,
<Axes: xlabel='s8', ylabel='s20'>,
<Axes: xlabel='s9', ylabel='s20'>,
<Axes: xlabel='s11', ylabel='s20'>,
<Axes: xlabel='s12', ylabel='s20'>,
<Axes: xlabel='s13', ylabel='s20'>,
<Axes: xlabel='s14', ylabel='s20'>,
<Axes: xlabel='s15', ylabel='s20'>,
<Axes: xlabel='s17', ylabel='s20'>,
<Axes: xlabel='s20', ylabel='s20'>,
<Axes: xlabel='s21', ylabel='s20'>,
<Axes: xlabel='ttf', ylabel='s20'>],
[<Axes: xlabel='setting1', ylabel='s21'>,
<Axes: xlabel='setting2', ylabel='s21'>,
<Axes: xlabel='s2', ylabel='s21'>,
<Axes: xlabel='s3', ylabel='s21'>,
<Axes: xlabel='s4', ylabel='s21'>,
<Axes: xlabel='s6', ylabel='s21'>,
<Axes: xlabel='s7', ylabel='s21'>,
<Axes: xlabel='s8', ylabel='s21'>,
<Axes: xlabel='s9', ylabel='s21'>,
<Axes: xlabel='s11', ylabel='s21'>,
<Axes: xlabel='s12', ylabel='s21'>,
<Axes: xlabel='s13', ylabel='s21'>,
<Axes: xlabel='s14', ylabel='s21'>,
<Axes: xlabel='s15', ylabel='s21'>,
<Axes: xlabel='s17', ylabel='s21'>,
<Axes: xlabel='s20', ylabel='s21'>,
<Axes: xlabel='s21', ylabel='s21'>,
<Axes: xlabel='ttf', ylabel='s21'>],
[<Axes: xlabel='setting1', ylabel='ttf'>,
<Axes: xlabel='setting2', ylabel='ttf'>,
<Axes: xlabel='s2', ylabel='ttf'>,
<Axes: xlabel='s3', ylabel='ttf'>,
<Axes: xlabel='s4', ylabel='ttf'>,
<Axes: xlabel='s6', ylabel='ttf'>,
<Axes: xlabel='s7', ylabel='ttf'>,
<Axes: xlabel='s8', ylabel='ttf'>,
<Axes: xlabel='s9', ylabel='ttf'>,
<Axes: xlabel='s11', ylabel='ttf'>,
<Axes: xlabel='s12', ylabel='ttf'>,
<Axes: xlabel='s13', ylabel='ttf'>,
<Axes: xlabel='s14', ylabel='ttf'>,
<Axes: xlabel='s15', ylabel='ttf'>,
<Axes: xlabel='s17', ylabel='ttf'>,
<Axes: xlabel='s20', ylabel='ttf'>,
<Axes: xlabel='s21', ylabel='ttf'>,
<Axes: xlabel='ttf', ylabel='ttf'>]], dtype=object)
3.3 Principal Component Analysis¶
# SCikit-learn modules required for PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
train = pd.read_csv('train.csv')
col_to_analyze = ['setting1','setting2','setting3','s1','s2','s3','s4','s5','s6','s7','s8','s9','s10','s11',
's12','s13','s14','s15','s16','s17','s18','s19','s20','s21']
df_PCA = train[col_to_analyze]
target = train['label_mcc']
target2 = train['ttf']
# Normlization of the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_PCA)
# PCA instanciation on 8 components
n_components = 8
pca = PCA(n_components=n_components)
pca.fit(X_scaled)
PCA(n_components=8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=8)
# Cumulative variance calculation
scree = (pca.explained_variance_ratio_*100).round(2)
scree_cum = scree.cumsum().round()
axe_x = np.arange(1, len(scree_cum)+1)
plt.figure(figsize=(10,6))
plt.bar(axe_x, scree, color='steelblue')
plt.plot(axe_x, scree_cum, c='red',marker='o')
plt.xlabel("Number of components retained")
plt.ylabel("Explained Variance (%)")
plt.title("Scree Plot - Optimal Component Number Detection")
plt.xticks(np.arange(1,9,1))
plt.show(block=False)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# Tracé d'un tableau heatmap sur les différentes composantes.
pcs = pca.components_
x_axis_labels=['Component_1', 'Component_2', 'Component_3', 'Component_4', 'Component_5', 'Component_6',
'Component_7', 'Component_8']
fig, ax = plt.subplots(figsize=(20, 6))
sns.heatmap(pcs.T, vmin=-1, vmax=1, annot=True, cmap="coolwarm", fmt="0.2f", yticklabels=col_to_analyze,
xticklabels=x_axis_labels)
plt.title('Heatmap showing the composition of the different components of the PCA')
Text(0.5, 1.0, 'Heatmap showing the composition of the different components of the PCA')
X_proj = pca.transform(X_scaled)
palette_colors = ["green", "orange", "red"]
legend_labels = {0: "Normal condition", 1: "Low risk of failure", 2: "High risk of failure"}
target_mapped = [legend_labels[t] for t in target]
plt.figure(figsize=(8,8))
sns.scatterplot(x=X_proj[:,0], y=X_proj[:,1], alpha=0.2, hue=target_mapped, palette=palette_colors)
plt.show()
import matplotlib.colors as mcolors
# Ensure target2 is a continuous variable
# ...
plt.figure(figsize=(8, 8))
# Create the scatter plot
scatter = sns.scatterplot(
x=X_proj[:, 0],
y=X_proj[:, 1],
alpha=0.2,
hue=target2,
palette='magma'
)
# Create a normalizer object which will map the values of target2 to the range [0, 1]
norm = mcolors.Normalize(vmin=target2.min(), vmax=target2.max())
# Create a ScalarMappable and initialize with the norm object and the chosen colormap
sm = plt.cm.ScalarMappable(cmap="magma", norm=norm)
sm.set_array([])
# Add the color bar using the ScalarMappable
plt.colorbar(sm)
plt.show()
# Créer un DataFrame pour Plotly (si vos données ne sont pas déjà dans un DataFrame)
import pandas as pd
df = pd.DataFrame({
'Composant 1': X_proj[:, 0],
'Composant 2': X_proj[:, 1],
'Composant 3': X_proj[:, 2],
'Target': target
})
color_discrete_map = {'0': 'rgb(255,0,0)', '1': 'rgb(0,255,0)', '2': 'rgb(0,0,255)'}
# Création du graphique 3D
fig = px.scatter_3d(df, x='Composant 1', y='Composant 2', z='Composant 3',
color='Target',
color_discrete_map=color_discrete_map,
opacity=0.5)
# Mise à jour des légendes et titres
fig.update_layout(legend_title="New Legend Title",
scene=dict(
xaxis_title='Composant 1',
yaxis_title='Composant 2',
zaxis_title='Composant 3'
))
fig.update_traces(marker_size=2)
fig.update(layout_coloraxis_showscale=False)
fig.show()
3.4 EDA Summary:¶
- There is a very high correlation (> 0.8) between some features e.g.(s14 & s9), (s11 & s4), (s11 & s7), (s11 & s12), (s4 & s12), (s8 & s13), (s7 & s12). This multicollinearity may hurt the performance of some machine learning algorithms. So, part of these features will be target for elimination in feature selection during the modeling phase.
- Most features have nonlinear relation with the TTF, hence adding their polynomial transforms may enhance models performance.
- Most features exhibit normal distribution which is likely improves models performance.
- AUC ROC should be used for classification models evaluation instead of Accuracy due to class’s imbalance in the training data.